Integrated analysis of variants and pathways in genome-wide 
association studies using polygenic models of disease 
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Abstract 

Many common diseases are highly polygenic, modulated by 
a large number genetic factors with small effects on suscep- 
tibility to disease. These small effects are difficult to map 
reliably in genetic association studies. To address this prob- 
lem, researchers have developed methods that aggregate in- 
formation over sets of related genes, such as biological path- 
ways, to identify gene sets that are enriched for genetic vari- 
ants associated with disease. However, these methods fail to 
answer a key question: which genes and genetic variants are 
associated with disease risk? We develop a method based 
on sparse multiple regression that simultaneously identifies 
enriched pathways, and prioritizes the variants within these 
pathways, to locate additional variants associated with dis- 
ease susceptibility. A central feature of our approach is 
an estimate of the strength of enrichment, which yields a 
coherent way to prioritize variants in enriched pathways. 
We illustrate the benefits of our approach in a genome- 
wide association study of Crohn's disease with ~440,000 
genetic variants genotyped for ~4700 study subjects. We 
obtain strong support for enrichment of IL-12, IL-23 and 
other cytokine signaling pathways. Furthermore, prioritiz- 
ing variants in these enriched pathways yields support for 
additional disease-association variants, all of which have 
been independently reported in other case-control studies 
for Crohn's disease. 



Author Summary 

Genome-wide association studies have helped locate genes 
that contribute to common diseases. The analysis of these 
studies is typically straightforward: systematically test each 
genetic variation in isolation whether it is correlated with 
susceptibility to disease. This approach often works well to 
identify commonly occurring variants with moderate effects 
on disease risk, but the effects of many variants are so small 
they fail to show statistically significant correlations. This 
is a concern because many common diseases arc modulated 
by a large number of genetic factors with small effects on 
disease risk. An alternative strategy is to examine groups 
of variants, such as variants sharing a common biological 
pathway, and to test whether each group is enriched for 
disease correlations. This can be a more effective approach 
to identify genetic factors relevant to disease. However, it 



does not indicate which individual genes or variants are 
associated with disease, which remains an important ques- 
tion. To address this limitation, we describe a statistical 
framework that integrates enrichment analysis with disease- 
correlation tests for variants. We illustrate this approach in 
a case-control study for Crohn's disease. We show that our 
approach uncovers disease-susceptibility genes that are not 
identified in conventional analyses of the same data. 

Introduction 

By surveying genetic variation throughout the genome, and 
systematically searching for variants correlated with disease 
phenotypes, genome-wide association studies (GWAS) have 
led to the discovery of genes and genetic loci that were not 
previously suspected of playing a role in complex diseases 
pHl] . For example, the discovery of disease-correlated vari- 
ants in GWAS of Crohn's disease, a common form of inflam- 
matory bowel disease, has helped draw links to genes that 
regulate autophagy and innate immune responses [sj-Hoj. 

In most analyses of GWAS, genetic associations are 
identified by testing each marker one at a time for asso- 
ciation with disease. Additional clues about genetic vari- 
ants, such as whether they are coding, exonic, or lie near 
the transcription start site of a gene, are usually consid- 
ered post hoc, rather than incorporated into the statistical 
analysis. While such "hypothesis-free" analyses have suc- 
cessfully identified variants associated with disease, they 
have important shortcomings. Many commonly occurring 
diseases are believed to be modulated by a large number 
of genetic factors, each which have a small effect on dis- 
ease risk, so they may be difficult to identify in a GWAS 
pdl - [T3 |. This problem is compounded, according to some 
predictions (TJ IS EM Ell > by the low prevalence of many 
alleles conferring risk to complex diseases. Motivated by 
these shortcomings, researchers have developed "pathway 
analysis" approaches to GWAS [l~7l - [2ll | . These methods are 
grounded on the theory that complex disease arises from the 
accumulation of gene tic effects acting within common bio- 
logical pathways [22l - l2a ] . A main goal of pathway analysis 
is to identify pathways that are enriched for disease — that 
is, groups of related genes that are more likely to harbour 
disease-associated variants compared to arbitrary regions 
of the genome. Pathway analysis can improve power to un- 
cover genetic factors relevant to disease because identifying 
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the accumulation of small genetic effects acting in a com- 
mon pathway is often easier than mapping the individual 
genes within the pathway that contribute to disease sus- 
ceptibility 

A limitation of analyses that identify enriched path- 
ways is that they do not provide feedback about associated 
variants within these pathways; identifying an enriched set 
of genes does not indicate which variants are associated 
with the disease, or even which genes harbour such vari- 
ants. Yet discovery of associated variants and genes is a 
primary motivation for GWAS, and an important step to- 
ward understanding the biology of disease and, ultimately, 
developing effective medical therapies. In principle, iden- 
tifying enriched pathways should help us locate variants 
associated with disease, because knowing that a variant 
lies in or near a gene in an enriched pathway makes it a 
better disease-association candidate. Therefore, prioritiz- 
ing these variants — say, by slightly relaxing thresholds of 
significance — should yield a greater number of reproducible 
associations for a given rate of false positives. But it is un- 
clear how to implement this in practice: to what extent can 
we relax significance thresholds while keeping the rate of 
false positives at an acceptable level? 

To address this problem, we develop a model-based ap- 
proach for joint analysis of pathways and genetic variants, 
in which we interpret enrichment as a model parameter. 
The enrichment parameter quantifies the increase in the 
probability that each variant in the pathway is associated 
with disease risk. By jointly analyzing variants and path- 
ways, our method adjusts association evidence in light of 
estimated pathway enrichments — sometimes called pathway 
or gene prioritization [22l. l26U34l | — and, simultaneously, ad- 
justs enrichment estimates to reflect evidence of associa- 
tions in pathways. 

Our approach builds on statistical methods for simulta- 
neous mapping of genetic variants in GWAS [H|-|43| . In con- 
trast to single-marker regression approaches, these meth- 
ods model susceptibility to disease by the combined effect 
of multiple variants, and use sparse multivariate regression 
techniques to fit multi-marker (i.e. polygenic) models to the 
data. By adopting a multi-marker disease model, estimat- 
ing enrichment effectively reduces to counting, inside and 
outside each candidate pathway, variants associated with 
disease — more precisely, the variants that are included in 
the polygenic disease model. Our approach to combining 
multi-marker modeling with pathway analysis offers sev- 
eral benefits. First, compared to single-marker approaches, 
multi-marker modeling improves power to detect genetic as- 
sociations [13, Hi], H|| . Second, unlike many pathway anal- 
ysis methods that test for enrichment of significant SNPs 
or genes in a pathway [2(i| . we have no need to select a 
significance threshold for p-values; instead, we use the as- 
sociation signal across all genes to assess enrichment. Third, 
by analyzing variants simultaneously, we avoid exaggerat- 
ing evidence for enrichment from associated variants that 
are correlated with each other (i.e. in linkage disequilib- 
rium), while still allowing multiple independent association 
signals near a gene to contribute evidence for enrichment. 
And fourth, quantifying enrichment within this framework 
naturally gives us feedback about associations within en- 
riched pathways, potentially leading to discovery of novel 
genetic loci underlying disease. 

Though we focus on incorporating pathways (and, 
more broadly, biologically related gene sets) into analy- 



sis of GWAS, our methods also apply to other types of 
genome annotations. In this respect, our work is related to 
other model-based approaches for estimating enrichment of 
genome- wide association signals across functionally related 
genomic regions, like transcription factor binding sites [45l — 
|48I | . One distinguishing feature of our approach is that we 
have the ability to test for enrichment, which is important 
for assessing which candidate pathways show the strongest 
support for enrichment. We assess support for enrichment 
by framing it is as a model comparison problem. An advan- 
tage of this approach is that we can use the same approach 
to assess support for the enrichment of combinations of two 
or more pathways. This is useful, for example, if a pathway 
relevant to disease pathogenesis is ranked highly only after 
combining it with another pathway relevant to the disease. 

Another feature that distinguishes our analysis is that 
we use multiple pathway databases in an attempt to inter- 
rogate pathways as comprehensively as possible — the more 
pathways we consider, the greater chance we have of draw- 
ing new connections between pathways, genes within these 
pathways, and complex disease. We demonstrate how us- 
ing our approach to comprehensively interrogate pathways 
results in increased evidence for enrichment, and is robust 
to inclusion of a large number of irrelevant pathways. Our 
study includes ~3I00 candidate pathways drawn from eight 
well-developed pathway databases available on the Web 

We demonstrate our approach in a detailed analysis of 
a GWAS for Crohn's disease with about 440, 000 single nu- 
cleotide polymorphisms (SNPs) genotyped in roughly 1700 
cases and 3000 controls. This is a convenient study for 
gauging the benefits of our approach because genetic as- 
sociations have already been published based on data from 
this study [5l|, and pathway analyses of these data have 
found evidence for enriched pathways (171 |52 - |55| . Our en- 
richment results highlight the role of cytokines that mod- 
ulate immune responses in Crohn's disease, and the IL-12 
and IL-23 signaling pathways, which have been previously 
linked to the disease [1, HH, Hjf] • And, by prioritizing vari- 
ants within these enriched pathways, our method identifies 
disease-susceptibility candidates that are not deemed sig- 
nificant in conventional analyses of the same data, including 
the STAT3 gene, the IBD5 locus, and the major histocom- 
patibility complex (MHC) class II genes. All these genetic 
associations have been independently confirmed in other 
studies, demonstrating that our methods have the poten- 
tial to yield novel biological insights. 

Overview of statistical analysis 

Our approach builds on previous work that casts simul- 
taneous analysis of genetic variants as a variable selection 
problem — the problem of deciding which variables (the ge- 
netic variants) to include in a multivariate regression of 
the phenotype. We begin with a method that assumes each 
variant is equally likely to be associated with the pheno- 
type [36l [37J , then we modify this assumption to allow for 
enrichment of associated variants in a pathway. 

The data from the GWAS are the genotypes X = 
(xi, . . . ,x n ) T and phenotypes y = (yi,...,y n ) T from n 
study participants. Here we assume the genetic markers 
are SNPs, and the phenotype is disease status: patients with 
the disease ("cases") are labeled y< = 1, and disease- free in- 
dividuals ("controls") are labeled = 0. Entries of the nxp 
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matrix X are observed minor allele counts Xij € {0, 1,2}, 
or expectations of these counts estimated using genotype 
imputation [13, HI], for each of the n samples and p SNPs. 

We assume an additive model of disease risk, in which 
the log-odds for disease is a linear combination of the minor 
allele counts: 

log ( P [ yi = I] } = A) + xnPx + ■■■+ x ip p p . (1) 



.P(Vi = 0). 

Under this additive model, is the odds ratio, the mul- 
tiplicative increase in odds of disease for each copy of the 
minor allele at locus j. We do not consider dominant or 
recessive effects on disease risk, but it would be straightfor- 
ward to include them; see (59|. This method is also easily 
adapted to quantitative traits by replacing ([T]) with a linear 
regression for y. 

Although the log-odds for disease is expressed in (fTJ as 
a linear combination of all SNPs, we assume most SNPs j 
have no effect on disease risk (j3j = 0). We refer to SNPs j 
that have non-zero coefficients ($j ^ 0) as being "included" 
in the multi- marker disease model. A SNP's inclusion sig- 
nals that it affects susceptibility to disease, or that is in 
linkage disequilibrium with other, possibly untyped, risk- 
conferring variants. Therefore, the main goal of the analy- 
sis is to compute, for each SNP j, the posterior inclusion 
probability, PIP(j) = p(/3j ^ | X,y). A high posterior in- 
clusion probability is the analogue of a small p-value in a 
conventional single-marker analysis. 

To obtain these posterior inclusion probabilities, we 
must first specify a prior. A standard assumption, and the 
assumption made in previous approaches [36l . l37j , is that 
SNPs are equally likely to be associated with the pheno- 
type a priori; that is, TVj = p(/3j 0) is the same for all 
SNPs. 

To model enrichment of associations within a pathway, 
we modify this prior. Precisely, the prior inclusion proba- 
bility for SNP j depends on whether or not it is assigned 
to the enriched pathway: 



logi 



1 



(2) 



The pathway indicators aj keep track of which SNPs are 
assigned to the enriched pathway: aj = 1 when SNP j is as- 
signed to the enriched pathway, otherwise aj =0. (In brief, 
a SNP is assigned to a pathway if it is near a gene in the 
pathway; see Methods.) We refer to 9q as the genome-wide 
log-odds, since it reflects the overall proportion of SNPs 
that are included in the multi-marker disease model. (More 
precisely, it is the proportion for SNPs not assigned to the 
pathway, but this is usually most SNPs.) We refer to 9 as 
the enrichment parameter because it corresponds to the in- 
crease in probability (on the log-odds scale) that a SNP 
assigned to the pathway is included in the model. For ex- 
ample, 6>o = — 5 and 6 = 2 indicates that 1 out of every 
10,000 SNPs outside the pathway is included in the multi- 
marker model, but for SNPs assigned to the pathway, 1 
out of every 100 are included. If 9 = 0, this reduces to the 
standard prior assumption made by previous methods. We 
expect 6 to be zero, or close to zero, for most pathways. 

To assess evidence for enrichment of a candidate path- 
way with indicators a = (ox, ■ • ■ , a p ), we compute a Bayes 
factor Hill: 

p(y\X,a,9>0) 

BF(fl) = P (y\x,e = o) ' (3) 



This Bayes factor (BF) is the ratio of likelihoods under 
two models, the model in which the candidate pathway is 
enriched for SNPs included in the multi-marker regression 
(9 > 0), and the null model that no pathways are enriched 
(9 = 0). A larger BF implies stronger evidence for enrich- 
ment. We compute each BF by averaging, or integrating, 
over the unknown parameters, and over multi-marker mod- 
els with different combinations of SNPs, using appropriate 
prior distributions for 9o and 9 (see Methods). 

We use the same approach to test for joint enrichment 
of multiple candidate pathways. We compute BF(a) as be- 
fore (eq. [3]), except that we set aj to 1 whenever SNP j 
is assigned to at least one of the pathways. In this case, 9 
represents the increased rate of associations among SNPs 
assigned to one or more of the pathways. This is equivalent 
to assuming that all enriched pathways have the same level 
of enrichment. It would be possible to relax this assump- 
tion, but at the cost of complicating the analysis, so we 
restrict ourselves to a single enrichment parameter. 

To assess evidence for association of individual variants 
with the phenotype, we compute PIP(j) for each variant 
j. These posterior probabilities depend on which pathways 
are enriched, and on the strength of enrichment 9, because 
these factors affect the prior probabilities itj , which in turn 
affect the posterior probabilities PIP(j) following Bayes' 
rule. (In practice, we account for uncertainty in 9$ and 
9 when calculating the posterior probabilities by averag- 
ing over 9q and 9; see Methods.) Since enrichment leads 
to higher prior inclusion probabilities for SNPs in the en- 
riched pathway, an association that is not identified by a 
conventional genome- wide analysis, perhaps because the al- 
lele appears infrequently in the population, or because the 
nearby functional polymorphism has only a small effect on 
disease risk, may become a strong candidate in light of its 
presence in an enriched pathway. Because we estimate 9 
from the data, the extent to which we prioritize variants 
in enriched pathways is determined by the data. In this 
way, our framework integrates the problem of identifying 
enriched pathways with the problem of prioritizing variants 
within enriched pathways. 



Results 

We illustrate our methods through an extended example — 
analysis of genome-wide marker data from the WTCCC 
Crohn's disease study [5l| . After steps to ensure data qual- 
ity, the data consist of -440, 000 SNPs genotyped for 1748 
cases and 2938 controls. (See Methods for details.) Crohn's 
disease is well suited to illustrating the benefits of path- 
way analysis because many pathways related to immune 
function and inflammatory response have been character- 
ized. Additionally, genetic associations have been published 
based on data from this study [5l| . and have been repli- 
cated in a follow-up study (62|. Beyond these association 
analyses, enrichment analyses of these data have provided 
further support for links between Crohn's disease and path- 
ways related to adaptive and innate immunity (l7L [52h55| . 

We have three main goals in presenting this case study: 
first, to explain how to use and interpret the BFs, PIPs, 
and other statistics relevant to joint analysis of variants 
and pathways; second, to highlight the features, and limita- 
tions, of our approach; and third, to examine the hypothesis 
that we can gain additional insights into Crohn's disease by 
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reassessing the evidence for associations between variants 
and disease in light of pathway enrichment findings. 

Our analysis proceeds in three stages. First, we compute 
a BF for each candidate pathway, and use these Bayes fac- 
tors to rank the pathways. Second, based on this ranking, 
we assess evidence for models in which two or more path- 
ways are enriched. Third, we investigate whether the most 
compelling pathway enrichments can help us locate Crohn's 
disease associations beyond those identified in analyses that 
ignore information about pathways. 

Bayes factors for enriched pathways 

To rank the 3158 candidate pathways by their evidence for 
enrichment, we compute, for each pathway, a Bayes Factor 
that measures support for enrichment relative to the null 
hypothesis. All candidate pathways have been curated by 
domain experts, or are based on experimental evidence in 
other organisms and inferred via gene homology. To be as 
comprehensive as possible, we draw pathways from a va- 
riety of publicly accessible collections, and we include all 
pathways, even those that are unlikely to be relevant to 
Crohn's disease based on current understanding of disease 
pathogenesis. To help make our results replicable, we doc- 
ument the pathway databases used, and the steps taken to 
compile gene sets from these pathway data. (See Table [4] 
for the list of pathway databases used; see Supplementary 
Results for statistics on gene sets and gene coverage from 
these databases; see Methods for retrieval and processing 
of pathway data, and assignment of SNPs to pathways.) 

Many of the pathways in the databases are arranged 
hierarchically — we include all elements of the hierarchy in 
our analysis. Elements in upper levels of the hierarchy refer 
to groups of pathways with shared attributes or a common 
function. Some groups have a broad definition, such as "im- 
mune system" in Reactome, which includes pathways in- 
volved in adaptive and innate immune response. (Hereafter, 
we use the term "pathway" to refer to a set of biologically 
related genes.) Enrichment of a broad group of genes is 
unlikely to provide novel insights into disease pathogene- 
sis. However, a key step in our analysis is to re-interrogate 
SNPs for association in light of inferred enrichments. Thus, 
enrichment of a broad physiological target like "immune 
system" can be useful if subsequent re-interrogation reveals 
associations that were not significant in a conventional anal- 
ysis. 

We find that the vast majority of pathways show lit- 
tle or no evidence for enrichment; of the 3518 candidate 
pathways, 2850 (90%) have BF < 1, and an additional 233 
pathways (7%) have BFs between 1 and 10. Table [1] shows 
the 18 pathways with BF > 100. The cutoff at 100, as with 
any cutoff, is somewhat arbitrary. We discuss this issue and 
interpretation of the BFs below. 

Several gene sets in Table Q] are subsets of one another 
(refer to Fig. IB. II for relationships among these pathways 
in the Reactome and PID hierarchies). For example, "im- 
mune system" overlaps with eight other pathways in the 
table, including cytokine signaling. Several pathways ap- 
pear in the table twice because the NCBI BioSystcms and 
Pathway Commons databases sometimes disagree about 
the set of genes assigned to a pathway. These discrepancies 
can have a substantial impact on the results. For example, 
the BF for the Pathway Commons version of cytokine sig- 
naling is smaller than the BioSystcms version by a factor of 



roughly 1000, due primarily to the lack of inclusion of NOD2 
and MHC genes that contribute to the association signal. 
Conversely, the BF for the Pathway Commons version of IL- 
23 signaling is about 80 times larger than the BioSystems 
version because the former includes the NF-kB pathway, 
and this pathway contains several genes that contribute 
to the associational signal, notably NOD2. (Inclusion of 
the NF-kB pathway is supported by experimental evidence 
p7j.) These results illustrate the benefits of a comprehen- 
sive analysis that considers pathways from multiple sources. 
Also note that no pathways from HumanCyc and KEGG, 
which are mainly focused on metabolic pathways, show up 
in Table [T] (nor do any pathways from Cancer Cell Map, 
PANTHER and WikiPathways) . This points to the robust- 
ness of our approach to inclusion of a large number of ir- 
relevant pathways. 

All the pathways in Table [1] are related in some way 
to immune system function. These pathways implicate 
key actors in responses to pro-inflammatory stimuli and 
in regulation of innate and adaptive immunity. This in- 
cludes members of the NF-kB/RcI family, T-cell receptors 
(TCRs) , members of the protein tyrosine phosphatase fam- 
ily (PTPs), mitogen-activated protein (MAP) kinases such 
as c-Jun NH2-terminal kinases (JNKs), and chemokine re- 
ceptors (cxcrs) glial] • 

In this initial ranking, three pathways stand out as hav- 
ing stronger evidence for enrichment than the others: cy- 
tokine signaling in immune system, IL23-mediated signal- 
ing events, and IL12-mediated signaling events. The "cy- 
tokine signaling in immune system" pathway, with BF = 
7.9 x 10 5 , is a collection of cytokine-driven networks that 
modulate immune responses. Cytokines, a class of chemical 
messengers that includes interferons (IFNs), interleukins 
(ILs) and tumor necrosis factors (TNFs), have well under- 
stood roles in immune processes. Accordingly, they exhibit 
a complex relationship to autoimmunity: in addition to pro- 
moting inflammatory and immune responses, they play an 
important role in suppressing immunity [72| . Accumulating 
evidence points to cytokines, and the signaling cascades 
initiated by these cytokines, in a range of autoimmune dis- 
orders, including inflammatory bowel disease H, [71, ITU ]. 
Enrichment of "cytokine signaling in immune system" is 
consistent with this view. Even though it implicates a broad 
class of genes (the BioSystems version has 225 genes), en- 
richment for Crohn's disease associations is biologically 
plausible; immune response involves a complex interplay 
among signaling pathways, so a collection of pathways may 
explain the pattern of genetic associations better than any 
single signal transduction pathway. 

Enrichment of the IL-23 pathway for Crohn's disease 
associations is consistent with its involvement in intesti- 
nal inflammation, specifically in mediating differentiation of 
CD4 + Thl7 cells [751 ]. Additional findings from mouse mod- 
els and genetic association studies support its role in inflam- 
matory bowel disease [1, [H, I75T477I ] . IL-12 is also thought 
to be important for immune response and differentiation of 
Thl7 cells, even if many of the regulating activities previ- 
ously ascribed to IL-12 are due to IL-23 instead [Tfl l75| . 
Although IL-12 and IL-23 have distinct functions in regu- 
lation of T helper cells, their pathways have many cytokine 
and cytokine receptor components in common [7a ]. so it 
may be difficult to tease apart their roles in disease solely 
by looking at enrichment in their gene sets; of the 66 genes 
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Table 1. Pathways, and groups of related pathways, that exhibit the strongest evidence for enrichment 
of Crohn's disease associations, as measured by their Bayes factors 







number of 


Bayes 










enriched pathway 


database 


genes/SNPs 


factor 




enrichment 


Cytokine signaling in immune system 


React. (BS) 


225/6711 


7.9 x 10 5 


—4.1 


1.96 


(1.25- 


-2.50) 
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IL12-mediated signaling events 


PID (PC) 


111/3641 


5.4 xlO 3 


-4.0 


1.90 


(1.25- 


-2.50) 


Immune system 


Keact. (Kb) 


7rr C\y~C\ 

755/20,959 


1.3 x 10 3 


—4.1 


1.44 


'0.75- 


-2.00) 


Signaling by interleukins 


React. (BS) 


111/3678 


769 


-3.9 


1.70 


'1.25- 


-2.25) 


Interferon gamma signaling 


React. (BS) 


73/2290 


700 


—3.8 


1.73 


'1.25- 


-2.25) 


Immune system 


React. (PC) 


529/15,074 


692 


—4.1 


1.49 


'0.75- 


-2.00) 


Interferon signaling 


React. (BS) 


111/2965 


491 


-3.8 


1.68 


(1.25- 


-2.25) 


Cytokine signaling in immune system 


React. (PC) 


193/6194 


292 


—3.9 


1.55 


(1.00- 


-2.25) 


Signaling events mediated by TC-PTP 


PID (PC) 


92/3044 


219 


-3.9 


1.62 


'1.00- 


-2.25) 


TAK1 activates NF-kB by phosphory- 
lation and activation of IKKs complex 


React. (BS) 


24/844 


181 


-3.8 


2.00 


(1.25- 


-2.50) 


Selective expression of chemokine 


BioCarta 


29/880 


139 


-3.8 


2.01 


(1.25- 


-2.75) 


receptors during T-cell polarization 




IL27-mediated signaling events 


PID (PC, BS) 


26/796 


130 


-3.8 


2.00 


(1.25- 


-2.50) 


IL23-mediated signaling events 


PID (BS) 


37/1252 


119 


-3.8 


1.89 


(1.25- 


-2.50) 


CXCR4-mediated signaling events 


PID (PC) 


190/6733 


118 


-3.9 


1.43 


(0.75- 


-2.00) 


Activated TAK1 mediates p38 MAPK 
activation 


React. (BS) 


17/535 


116 


-3.8 


2.07 


(1.25- 


-2.75) 


TCR signaling in naive CD4+ T cells 


PID (PC) 


133/4809 


103 


-3.9 


1.47 


(0.75- 


-2.00) 


JNK phosphorylation and activation 
mediated by activated human TAK1 


React. (BS) 


16/559 


102 


-3.8 


2.03 


(1.25- 


-2.75) 



The table includes all gene sets with BF > 100. Abbreviations used in this table: React. = Reactome [63||. PID = NCI Nature 
Pathway Interaction Database [64fl. BS = NCBI BioSystems [65|], PC = Pathway Commons [6(|; 80 = posterior mean of 
genome-wide log-odds 9o given that pathway is enriched; 9 = posterior mean of enrichment 9 (and 95% credible interval) given 
that pathway is enriched. The credible interval is the smallest interval about the posterior mean that contains 9 with 95% 
posterior probability. It is calculated to the nearest 0.25 using a numerical approximation (see Supplementary Methods). Note 
that these numbers may not be reproduced exactly in an independent analysis using the same method, due to stochasticity in our 
approximate computations. However, only slight deviations from these numbers are expected. 



assigned to the IL-23 pathway, 54 are assigned to the IL-12 
pathway (Pathway Commons version). 

Previous findings from genome-wide association stud- 
ies have linked autophagy genes ATG16L1 and IRGM to 
Crohn's disease [HI, [79|, |8(| . Our pathway analysis does not 
provide additional support for autophagy in Crohn's disease 
because pathways reflecting current models of autophagy 
H, HH have not yet been incorporated, to our knowledge, 
into any of the publicly available pathway databases. 

Many of the pathways listed in Table [T] are related to 
those identified in previous pathway analyses of Crohn's 
disease [13, H3413l, including Jak-STAT signaling [|p and 
T cell receptor signaling [53. Other pathways highlighted 
in previous analyses show some evidence for enrichment in 
our analysis, but these arc eclipsed by much stronger enrich- 
ment signals (Table Ql. For example, Wang et al [55| obtain 
the most evidence for enrichment of "IL12 and Stat4 depen- 
dent signaling in Thl development" (p-value = 8 x 10~ 5 , 
FDR = 0.045) based on enrichment analysis of BioCarta, 
KEGG and Gene Ontology [82| gene sets, whereas this 
pathway showed only modest evidence for enrichment in our 
analysis (BF = 18) compared to the pathways in Tabled] 
(Below, we obtain greater support for enrichment of this 
pathway when combined with cytokine signaling genes.) 
Moveover, all except one of the pathways in Table Q] are 
from databases that were not included in (55|- These re- 
sults illustrate the benefits of a comprehensive search for 
enrichment across multiple pathway databases. 



Given that enrichment analyses typically proceed by 
computing p-values and assessing "significance," one may 
reasonably ask whether the BFs in Table [1] represent "sig- 
nificant" evidence for enrichment. Specifying an appropri- 
ate threshold for a BF to be considered significant, how- 
ever, is context-dependent, and subjective. This is because 
the posterior odds for a pathway being enriched, relative to 
the null hypothesis that no pathways are enriched, is equal 
to the Bayes Factor times the prior odds for enrichment, 
and the prior odds for each pathway depends on how plau- 
sible it is, a priori, that the pathway is involved in Crohn's 
disease pathogenesis. (Similar issues arise when specifying 
significance thresholds for p-values. For example, the false 
discovery rate at a given p-value threshold depends on the 
prior probability of enrichment [H, H3|- In practice, how- 
ever, significance thresholds of 0.05 or 0.01 are often used 
without attending to this concern.) Nonetheless, we can 
make the following observations. First, if we are willing to 
assume the pathways in Table [T] are all equally plausible 
candidates for enrichment a priori, then the ratio of the 
BFs indicates the relative support for the enrichment hy- 
potheses. For example, if we are forced to choose between 
enrichment of "cytokine signaling in immune system" and 
"IL23-mcdiated signaling events," the data overwhelmingly 
favour the former by a factor of g^io 3 ~ 87- Second, even 
under a "conservative" prior for enrichment in which we 
expect one pathway to be enriched among the 3158 can- 
didates, corresponding to a prior odds of 1/3158, the top 
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Table 2. Pairs of pathways with strongest evidence for joint enrichment of Crohn's disease associations 



enriched pathways 



number of Bayes 
database genes/SNPs factor 







cytokine signaling in immune system and . . . 
IL23-mediated signaling events 
Selective expression of chemokine receptors during 
T-cell polarization 
Thl/Th2 differentiation 
N02-dependent IL-12 pathway in NK cells 
IL27-mediated signaling events 
IL12-mediated signaling events 

IL-12 and Stat4 dependent signaling pathway in Thl 
development 

IL23-mediated signaling events 



PID (BS) 

BioCarta 

BioCarta 
BioCarta 
PID (PC, BS) 
PID (BS) 

BioCarta 

PID (PC) 



247/7438 

248/7375 

236/7090 
239/7143 
238/7073 
263/7807 

241/7272 

266/8060 



5.5 x10 s 
4.8 x10 s 

3.6 x10 s 
3.2 x10 s 
3.0 x10 s 
2.4 x10 s 

2.4 x10 s 

2.2 x10 s 



2.33 

2.33 

2.32 
2.31 
2.30 
2.30 

2.30 

2.29 



This table includes every model with two enriched pathways that has a BFs greater than 10 8 . All these combinations includes 
"cytokine signaling in immune system" (the Biosystems version with 225 genes). Refer to Table [1] for the legend. "Number of 
genes/SNPs" gives the total number assigned to the enriched pathways. For every enrichment hypothesis listed in the table, 
6o = 4.4, and the 95% credible interval for 6 to the nearest 0.25 is (1.75-2.75). Refer to Fig. IB.ll for relationships among 
pathways in the Reactome and PID hierarchies. 



three pathways have BFs that are large enough (greater 
than 3158) to support enrichment. 

Assessing combinations of pathways for enrichment 

The initial ranking (Tabic [1]) suggests that enrichment of 
a group of pathways, cytokine signaling in immune sys- 
tem, offers a better fit to the pattern of Crohn's disease 
associations than any one pathway. But the question re- 
mains whether some other combination of pathways offers 
a better fit. A benefit of our approach is that we can di- 
rectly compare support for enrichment of different com- 
binations of pathways by comparing their BFs (assuming 
the same prior for these hypotheses). This is because the 
ratio BF(a)/BF(a*) is the same as the Bayes factor that 
compares support for the enrichment model encoded by a 
versus the model encoded by a*. (By contrast, it is harder 
to make such comparisons using p- values. For example, if p 
is the p- value for testing hypothesis a against the null, and 
p* is the p-vahie for testing a* against the null, it is not 
clear how to compare support for a and a*.) 

We begin by computing BFs to quantify support for 
enrichment of pairs of pathways. Since it is impractical to 
consider all pairs, we tackle this in a "greedy" fashion by se- 
lecting combinations of pathways based on the initial rank- 
ing. Our strategy is to select pathways with the largest BFs 
(here we take IL-23, IL-12 and cytokine signaling), and we 
consider each of these in combination with pathways from 
a larger set of candidates (here we use the 72 pathways 
with BF > 10). This greedy heuristic makes it feasible to 
evaluate many combinations of pathways that could plau- 
sibly be jointly enriched, though since it does not consider 
all combinations, there is the risk of overlooking a combina- 
tion that provides a better fit to the pattern of associations. 
In total, we compute BFs for 219 pairs of pathways, which 
includes 3 combinations of IL-23, IL-12 and cytokine sig- 
naling, and 216 = 3 x 72 combinations of IL-23, IL-12 or 
cytokine signaling with another pathway 

Tablc[2]lists all models with two enriched pathways that 
have BF > 10 s . As before, all these pathways are related 
to innate and adaptive immune processes. Three of the 



BioCarta pathways appearing in this table, including the 
IL-12 and Stat4 dependent si gna ling pathway highlighted 
in the enrichment analysis of |55j ). did not originally show 
up in Table [T] because their BFs were less than 100. The 
largest BF, for enrichment of IL-23 and cytokine signal- 
ing, is unsurprising in light of the initial ranking, except 
that this version of the IL-23 pathway does not include the 
NF-kB pathway, which suggests that the NF-kB pathway 
does not contribute additional evidence for enrichment once 
we account for enrichment of cytokine signaling genes (14 
of the 35 genes in the NF-kB pathway, including NOD2, 
are also members of cytokine signaling). Among hypothe- 
ses that do not involve cytokine signaling, the largest BF is 
5.1 x 10 7 , corresponding to a model in which IL-23 signaling 
and interferon gamma signaling are enriched. 

The top BF in Table d is ~700 times greater than the 
largest BF in Table [1] This suggests that the best model 
with two enriched pathways provides a much better fit to 
the data than the best model with any one enriched path- 
way. However, to properly interpret this result we must 
weigh this increase in the BF against the relative prior 
plausibility of the models, which is difficult to quantify. A 
naive argument using a "conservative" prior for any pair 
of pathways being enriched might suggest a prior odds of 
(1/3158) 2 , based on the conservative prior for a single path- 
way we discussed above. This prior would make a 700-fold 
increase in the BF appear to be relatively insignificant. 
However, this argument not only depends on the earlier 
prior, which may be overly conservative, but also assumes 
independence of enriched pathways, which seems unwise 
considering that many pathways mentioned here have re- 
lated roles in immunity; a priori, one might expect that a 
pathway is more likely to be enriched when a biologically 
related pathway is enriched. With this in mind, we inter- 
pret Tablc[2]as providing substantial, if short of compelling, 
support for the hypothesis that two pathways are enriched 
for Crohn's disease associations. Perhaps a more important 
question is whether these findings lead to identification of 
additional loci affecting susceptibility to Crohn's disease, a 
question we address in the next section. 

For completeness, we extend the analysis to models with 
three enriched pathways. Again, following a greedy straf- 
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Fig. 1. Scatterplots showing Pi, the posterior probability that each genomic segment contains Crohn's 
disease associations, given different hypotheses about enriched pathways. Each point corresponds to a 
segment of the genome containing 50 SNPs. Some of the segments are labeled by representative candidate genes. In 
panel A, genes assigned to "cytokine signaling in immune system" are written in bold. 



egy, we take all pairs of pathways with BF > 10 s (Table [2]) 
and combine these pairs with individual pathways that have 
BF > 100 (Table [J). Of the 126 resulting BFs, the largest 
is 8.4 x 10 9 , corresponding to enrichment of IL-23 signaling 
(Pathway Commons), cytokine signaling (BioSystems) and 
"TAK1 activates NF-kB by phosphorylation and activation 
of IKKs." This BF is only about 3 times greater than the 
largest BF in Table [2] Following our earlier arguments, this 
result does not constitute strong support for enrichment of 
three pathways. 

Associations informed by enriched pathways 

Now we examine how the pathway enrichment findings can 
help us identify additional genetic associations. The intu- 
ition is that, having established that variants near genes in 
a pathway are more likely to be associated with the phc- 
notypc, it is reasonable to up-weight, or prioritize, these 
SNPs in the statistical analysis. Our model-based frame- 
work achieves this by estimating an enrichment parameter 
representing the increased prior probability that SNPs as- 
signed to the pathway are associated with the phenotypc. 
This prior probability in turn raises the posterior probabil- 
ity of association for these SNPs. This ability to reassess 
the evidence for association of individual loci in light of 
enriched pathways is an important feature of our model- 
based approach to pathway analysis, as identifying individ- 
ual disease-susceptibility loci is potentially more informa- 
tive than observing that a pathway is enriched for disease 
associations. This is particularly the case for broad path- 
ways, such as cytokine signaling in immune system, which 
contains 225 genes, as only a small proportion of these genes 
may actually harbour genetic variants that affect Crohn's 
disease risk. 

In simultaneous analysis of genetic variants based on 
Baycsian variable selection, it is preferable, at least ini- 
tially, to assess evidence for associations across genomic 



regions, rather than for individual SNPs. This is because 
when SNPs are correlated with one another the association 
signal can be spread across these SNPs, diluting the signal 
at any given SNP [32} • Therefore, we divide the genome 
into overlapping segments of 50 SNPs, with an overlap of 
25 SNPs between neighbouring segments. For each segment, 
we compute Pi , the posterior probability that at least one 
SNP in the segment is included in the multi-marker disease 
model. We use segments with an equal number of SNPs so 
that, under the null hypothesis of no enrichment, the prior 
probability that at least one SNP is included is the same 
for every segment. Since a segment spans, on average, 307 
kb of the genome (98% of segments are between 100 kb and 
1 Mb long), calculating Pi for these segments provides only 
a low-resolution map of genetic risk factors for Crohn's dis- 
ease. Still, this resolution suffices for the objectives of this 
case study. In other applications, one could increase the 
resolution by calculating PIPs for individual SNPs within 
selected regions. Hereafter, we use P„ to denote the poste- 
rior probability that at least n SNPs arc included. 

First, we compare Crohn's disease associations identi- 
fied under the null hypothesis that no pathways are en- 
riched with associations identified under the model in which 
a single pathway, cytokine signaling in immune system, is 
enriched. Figure [T]A. shows how the evidence for association 
in each segment (Pi) changes once we account for enrich- 
ment of cytokine signaling genes. As expected, many seg- 
ments (corresponding to points above the diagonal in the 
scattcrplot) show increased evidence for association under 
the model in which cytokine signaling is enriched. These 
segments contain SNPs assigned to the enriched pathway. 
Segments corresponding to points on the diagonal show no 
change in support for association, and these are segments 
that do not contain SNPs assigned to the enriched pathway. 
Although it is not clear from the figure due to over-plotting, 
most segments lie near the bottom-left corner; when cy- 
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Table 3. Selected regions of the genome with strong evidence for Crohn's disease risk factors given that 
two pathways are enriched for Crohn's disease associations 





critical 




1 i 




L 2 


candidate 




MAF 


chr. 


region (Mb) 


null 


alt. 


null 


alt. 


gene(s) 


SNP 


ctrls 


cases 


Ip31 


67.30-67.48 


1.00 


1.00 


0.06 


0.48 


IL23R 


rsll805303 


0.318 


0.392 


2q37 


233.92-234.27 


1.00 


1.00 


0.01 


0.22 


ATG16L1 


rsl0210302 


0.481 


0.402 


5pl3 


40.32-40.66 


1.00 


1.00 


0.42 


0.42 


PTGER4 


rsl7234657 


0.124 


0.181 


* 5q23 


129.54-132.04 


0.18 


0.86 


0.02 


0.29 


multiple (IBD5) 


rs274552 


0.166 


0.128 


* 6p21 


32.3-32.92 


0.49 


0.98 


0.02 


0.31 


MHC class II 


rs9469220 


0.519 


0.466 


10q21 


64.0-64.43 


0.96 


0.96 


0.03 


0.03 


ZNF365 


rsl0995271 


0.386 


0.440 


10q24 


101.26-101.32 


0.95 


0.95 


0.01 


0.04 


NKX2-3 


rs7095491 


0.470 


0.528 


16ql2 


49.0-49.4 


1.00 


1.00 


0.11 


0.77 


NOD2 


rsl7221417 


0.287 


0.356 


* 17q21 


37.5-38.3 


0.12 


0.87 


0.01 


0.21 


STAT3 


rs744166 


0.439 


0.392 


18pll 


12.76-12.91 


0.94 


1.00 


0.01 


0.20 


PTPN2 


rs2542151 


0.163 


0.209 



For every region in this table, there is at least a 0.8 probability that one or more SNPs in the region are included in the 
multi-marker disease model (Pi > 0.8) given the hypothesis that two pathways are enriched. Rows marked with an asterisk (*) 
are selected only after accounting for enriched pathways. Table columns from left to right are: (1) chromosomal locus; (2) region 
most likely containing the risk-conferring variant (s), in Megabases (Mb); (3) posterior probability that one or more SNPs in the 
region are included in the model under the null hypothesis, and (4) under the alternative hypothesis that two pathways are 
enriched; (5) posterior probability that two or more SNPs are included under the null, and (6) under the alternative; (7) 
established genes in Crohn's disease pathogenesis, or most credible genes of interest, corresponding to the locus; (8) refSNP 
identifier of SNP in critical region with the largest PIP; (9) frequency of minor allele for that SNP in cases, and (10) in controls. 
The "critical region" at each locus is estimated by inspecting single-SNP BFs and bounding the region by areas of high 
recombination rate, inferred using data from Phase I, release 16a of the HapMap study [85|], and visualized in the UCSC Genome 
Browser [86|. All SNP information and genomic positions are based on human genome assembly 17 (NCBI build 35). 



tokine signaling is enriched, 17,261 out of 17,668 segments 
across the genome (97.7%) have P\ < 0.1. 

Points in the top-right corner of Fig. [T]A. correspond to 
regions with strong evidence for association even without 
the benefit of feedback from pathway enrichment. Genes 
IL23R, PTGER4, ZNF365, NKX2-3 and ATG16L1 are not 
involved in cytokine signaling, nor are any nearby genes, 
so the PIPs of SNPs near these genes are unaffected by 
the hypothesis that cytokine signaling is enriched. NOD2 
(also known as CARD15) and PTPN2 are cytokine signaling 
genes, so these associations contribute to the evidence for 
enrichment of this pathway, but because they already show 
strong support for association without enrichment (Pi is 
close to 1 under the null hypothesis), these associations 
are not greatly affected by enrichment. Reassuringly, these 
results recapitulate the stron gest associations reported in 
the original study (Table 3 in [51|) with trend p- values less 
than 4 x 10~ 8 , or those with additive BFs greater than 10 5 ' 4 . 
These results have also been replicated in a follow-up study 
(62[, and have been confirmed in meta-analyses with large 
combined sample sizes [6, 7]. Two additional regions at 3p21 
and 5q33 are reported as associations in the original study 
(5l| . although with trend-test p- values exceeding 5 x 10 -8 . 
These Crohn's disease associations are replicated elsewhere 
H, 0, H3 , but show only modest evidence for association in 
our analysis; these regions are not annotated to the cytokine 
signaling pathway, and the largest Pi for segments at these 
regions are 0.19 and 0.21, respectively. 

Points near the top-left corner of Fig. [TJV correspond to 
regions of the genome that show strong support for associ- 
ation only after accounting for enrichment of cytokine sig- 
naling. Three additional regions stand out: the MHC class 
II region, previously identified as a region showing mod- 
erate evidence of association [51| : the IBD5 locus at 5q31, 
which contains several candidate genes; and a region at po- 
sition 17q21 near gene STAT3. Other genome-wide studies 



and meta-analyses independently sup port Crohn's disease 
associations at these loci [q. 171 |87h91| . (See Supplementary 
Results for further details on these loci.) In addition, two 
loci at 16q24 and 10q22 near genes IRF8 and CAMK2G show 
moderate support for association under the enrichment hy- 
pothesis; Pi is 0.49 and 0.46, respectively. Neither of these 
loci have been identified as being significant associations in 
other Crohn's disease studies. However, the association at 
locus 16q24, 84.45-84.6 Mb near IRF8 is potentially inter- 
esting, as it has previously been identified in genome- wide 
studies of two other autoimmune diseases, multiple sclerosis 
[92| and systemic sclerosis [93| . This gene belongs to a fam- 
ily of transcription factors that regulate responses to type 
I interferons (IFN-a and IFN-/3), and these interferons are 
known play critical roles in modulating inflammatory and 
immune responses to pathogens [72j. 

Next we investigate whether allowing for two enriched 
pathways reveals any further associations. (See Methods 
for how Pi is computed by averaging over different models 
with two enriched pathways.) Figure [TJ3 shows that allow- 
ing for enrichment of two pathways does not yield com- 
pelling support for genetic associations beyond those re- 
vealed by enrichment of cytokine signaling. However, the 
segment with the greatest increase in Pi, from 0.04 to 0.53, 
is at 158.4-159.1 Mb on chromosome 5, near gene IL12B. 
This locus was reported as a Crohn's disease association in 
Jf|, and was later confirmed in 0. IL12B is also associated 
with other autoimmune diseases, including ulcerative coli- 
tis [sl, |9f| and psoriasis [96| . Note that all pathways listed 
in Table [5] except cytokine signaling implicate IL12B, so the 
association signal at this locus presumably contributes to 
evidence for enrichment of these pathways. 

Table [3] summarizes the Crohn's disease associations 
that are strongly supported by our analysis. Of the 10 re- 
gions listed in this table, 3 are revealed only after prior- 
itizing SNPs in enriched pathways. Each row in the ta- 
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ble shows the SNP within the critical region that has the 
largest PIP. In most cases, since only a portion of all com- 
monly occurring SNPs are included in the study, this SNP is 
most likely in linkage disequilibrium with the causal variant 
rather than being causal itself. We also show in this table, 
for each selected region, evidence for multiple independent 
risk factors, indicated by P2, the posterior probability that 
at least two SNPs in the region arc independently asso- 
ciated with Crohn's disease. Our calculations suggest the 
strong possibility of multiple independent risk factors at 
the NOD2 locus, with P2 = 0.77, a prediction that coin- 
cides with a previous study [97| . 

The large number of points approaching the middle of 
the y-axis in Fig.[T]A. suggests that many other gene variants 
involved in cytokine signaling may contribute to Crohn's 
disease risk. In fact, the estimates 9o = —4.11 and 9 = 1.96 
given in Tablc[T]imply that roughly 1 out of every 140 SNPs 
in the pathway are expected to be independent associations, 
for a total of about 47 independent risk factors for Crohn's 
disease hidden among "cytokine signaling in immune sys- 
tem" genes. Our analysis has only identified a few of these 
genes, which suggests that many more associations in this 
pathway remain to be discovered. This prediction coincides 
to some extent with a recent meta-analysis of Crohn's dis- 
ease association studies , as an additional 7 cytokine sig- 
naling genes— IL1R1, IP6K2, JAK2, IL2RA, TYK2, MAPK1 
and MAP3K71P1 — overlap with the susceptibility loci iden- 
tified in this meta-analysis. 



Sensitivity of pathway ranking to prior distribution of odds 
ratios 



Our approach requires specification of a prior distribution 
for the odds ratios (see Methods). We assume a prior in 
which the log-odds ratios (i.e. the coefficients f3j in the ad- 
ditive model of disease risk) are normally distributed with 
mean zero and standard deviation cr a = 0.1. This choice is 
based on odds ratios reported in published genome- wide as- 
sociation studies (see Methods). One concern is that slightly 
smaller or slightly larger settings of a a could also be jus- 
tified, and these choices could produce different results. 
Associations are unlikely to accumulate at a greater rate 
in pathways that are not related to the disease, even as- 
sociations with small effects on disease risk, so we predict 
that the ranking of pathway enrichments is largely robust 
to the choice of a a - Here we verify this claim. We assess 
the sensitivity of our results to a a by recomputing the BFs 
for all candidate pathways with prior choices that favor 
slightly smaller (a a = 0.06, 0.08) and slightly larger coeffi- 
cients (cr a = 0.15,0.2). 

Fig. [2] shows that smaller settings of a a yield substan- 
tially more support for enrichment of disease-related path- 
ways, as expected. But the pathways with the largest BFs 
are IL-23, IL-12 and cytokine signaling regardless of the 
choice of a a . In the Supplementary Results, we show that 
the BFs for most of the other 3158 candidate pathways do 
not change substantially at different settings of <j a . In sum- 
mary, we conclude that the pathway-level findings in this 
Crohn's disease study are largely robust to priors that are 
not substantially different from a a — 0.1. 
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Fig. 2. The top four BFs for each setting of cr a . In 

each case, the three largest BFs correspond, in order, to 
cytokine signaling in immune system, IL23-mcdiatcd 
signaling events, and IL12-mediated signaling events 
(these are also the top three pathways in Table [1]). The 
pathway with the fourth-largest BF differs across settings 
of a a . 



Discussion 

Pathway analysis for genome-wide association studies has 
been as advertised as a way to overcome some of the lim- 
itations of conventional approaches to identifying genetic 
factors underlying polygenic traits. Motivated by the obser- 
vation that it is easier, in principle, to identify associations 
within an enriched pathway, we developed a model-based 
approach for simultaneously estimating enrichment and pri- 
oritizing variants in enriched pathways. We investigated the 
merits and limitations of this approach in a GWAS for 
Crohn's disease. In this case study, we interrogated over 
3,000 candidate pathways from several pathway databases, 
and confirmed the importance of the IL-12 and IL-23 path- 
ways in Crohn's disease pathogenesis and, more broadly, 
the role of cytokines that mediate immune responses. By 
prioritizing variants within the enriched pathways identi- 
fied in our analysis, we established strong support for dis- 
ease susceptibility loci beyond what are revealed by a con- 
ventional analysis that ignores this pathway information. 
This suggests that applying our methods to larger samples 
may reveal genetic loci that have not yet been identified 
as risk factors for Crohn's disease and other common dis- 
eases. Moreover, leveraging knowledge about variants that 
modulate expression of genes (eQTLs) may also lead to 
stronger evidence for pathway enrichments, and for associ- 
ations within these pathways. 

Our approach to modeling enrichment was built on 
large-scale sparse regression methods that have been ap- 
plied to other problems in statistics and genetics. The key 
idea behind our approach was to introduce a parameter, 
that quantifies enrichment — precisely, the increase in the 
probability that a SNP assigned to a pathway is included 
in a polygenic model of the phenotype. Given the generality 
of this approach, our method could be useful for problems 
outside genetic association studies. One caveat is that some 
of the approximations we used — approximations which help 
scale the computations to hundreds of thousands of SNPs 
and thousands of candidate pathways (see Supplementary 
Methods) — may not be appropriate in some settings. 

One attractive feature of our approach, which we illus- 
trated in the Crohn's disease case study, is that it can be 
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used to assess how well the data support enrichment of dif- 
ferent combinations of multiple pathways. Examining com- 
binations of pathways for enrichment may highlight inter- 
esting pathways that would not otherwise be highly ranked. 
An example of this was the "IL-12 and Stat4 dependent 
signaling" pathway, which showed little support for enrich- 
ment on its own (BF = 18), but became more interesting 
when considered jointly for enrichment with cytokine sig- 
naling genes (BF = 2.4x 10 8 ). 

In contrast to many pathway analysis methods, we mod- 
eled pathway enrichment at the level of variants, rather 
than genes. While there are arguments for both approaches, 
a feature of the variant-based approach is that, when there 
arc multiple independent association signals near a gene, 
all these signals contribute to the evidence for enrichment 
of pathways containing this gene. 

The Crohn's disease study was aimed at identifying 
common variants associated with disease, but there is grow- 
ing interest in using exome and whole-genome sequencing 
to investigate the cont ribut ion of rare variation to complex 
diseases and traits [08l - 4lOOj ] . The computational complexity 
of our method grows linearly with the number of SNPs, so it 
should scale well to the large amount of data generated by 
high-throughput sequencing. Since detecting a ssociation s 
with individual rare variants is a hard problem [lOll . Il02j , 
pathway-based analysis approaches such as ours that ag- 
gregate association signals across sets of genes may play an 
important role in analysis of these studies. 

Currently, a drawback to our approach is that the prior 
variance of additive effects on disease risk must be chosen 
beforehand. We based our choice on the distribution of odds 
ratios reported in published genome- wide association stud- 
ies, and checked that the rankings of enriched pathways 
were robust to different prior choices. Ideally, we would es- 
timate this parameter from the data instead, but, as we 
discussed in Methods, we found that this did not work well 
for the Crohn's disease data. In our estimation, the root 
of the problem is that the log-odds ratios are not normally 
distributed (currently, we assume that non-zero effects fol- 
low a normal distribution) . One possible solution would be 
to use a more flexible distribution for the effect sizes, such 
as a mixture of two or more normals, but we have not in- 
vestigated this direction as it raises a number of questions 
regarding modeling and computation. Despite this issue, 
we believe that we have presented a useful framework for 
identifying enriched pathways and genetic associations un- 
derlying these pathways. 

Method 

Samples 

Our analysis is based on genome-wide marker data from a 
case-control study with 1748 subjects affected by Crohn's 
disease, and 2938 control subjects. The controls come from 
two groups: 1480 individuals from the 1958 Birth Cohort, 
and 1458 individuals from the UK Blood Services cohort. 
All subjects are from Great Britain, and of self-described 
European descent. Genetic associations from this study 
were first reported in [5l[ . 

All study subjects were genotyped for ~500, 000 SNPs 
using a commercial version of the Affymetrix GeneChip 
50 0K p latform. We apply quality control filters as described 
in [51| . and remove SNPs that exhibit no variation in the 



sample. We discarded two additional SNPs, rsl914328 on 
chromosome 8 and rs6601764 on chromosome 10, because 
they show some evidence for association (single-SNP BFs 
[59l | of 10 3 - 5 and 10 3 6 , respectively), but we cannot rule 
out the possibility of genotyping errors as these associa- 
tions are not supported by nearby SNPs (none of the nearby 
SNPs have single-SNP BFs greater than 40). After remov- 
ing these two SNPs, we end up with 442,001 SNPs on au- 
tosomal chromosomes. We estimate missing genotypes at 
these SNPs using t he m ean posterior minor allele count 
from BIMBAM jH, EH, with SNP data from Phase II of 
the International HapMap Consortium project j!04j . To be 
consistent with the original analysis, refSNP identifiers and 
locations of SNPs are based on human genome reference 
assembly 17 (NCBI build 35). 

Population stratification 

Analysis of pathways should, in principle, be robust to 
population stratification because spurious associations that 
arise from population structure are unlikely to accumulate 
at a greater rate in the pathway — recall, we define the en- 
richment parameter as the increase in proportion of vari- 
ants associated with disease risk relative to the proportion 
genome- wide. However, population stratification could still 
produce individual false positive associations, either inside 
or outside enriched pathways, so in gener al one should ac- 
count for this in the analysis I3l.l4l. ll05[4l07j |. For the Crohn's 
disease stud y, the original report |5l| and subsequent anal- 
yses d, Il08l | affirm that cryptic population structure does 
not have a substantive impact on the analysis. Thus we 
do not attempt to correct for population structure in our 
analysis. 

Pathways, and assignment of SNPs to pathways 

We aim for a comprehensive evaluation of pathways ac- 
cessible on t he W eb in standard, computer-readable for- 
mats [HI [5(3, Ill6| . Since the results hinge on the quality 
of the pathways used in our analysis, we restrict the anal- 
ysis to curated, peer-reviewed pathways based on experi- 
mental evidence, and pathways inferred via gene homology. 
We draw candidate pathways from the collections listed in 
Table |4] (see Supplementary Methods for details). KEGG 
and HumanCyc are primarily databases of metabolic path- 
ways, and are unlikely to be relevant to Crohn's disease, 
but we include them for completeness. 

Since wc combine pathways from different s ources, w e 
encounter pathways with inconsistent definitions [117L Ill8j . 
There is no single explanation for the lack of consensus 
in pathway definitions, and we have no reason to prefer 
one definition over another, so we include multiple ver- 
sions of a pathway in our analysis. Many of the pathways 
in these databases are arranged hierarchically; we incor- 
porate all elements of the hierarchy into our analysis. We 
treat each candidate pathway as a set of genes, ignoring 
details such as molecules involved in biochemical reactions, 
and cellular locations of these reactions. From 3788 to- 
tal pathways (Table H]), we obtain 3158 unique gene sets. 
With these pathways we achieve ~39% gene coverage (see 
Supplementary Results) . 

Based on findings that the majority of variants modu- 
lating gene expression lie within 100 kb of the gene's tran- 



10 



Carbonetto and Stephens: Integrated analysis of variants and pathways in genome-wide association studies 
Table 4. Overview of pathways used in the analysis 



database 


refs. 


download location 


^pathways 


BioCarta 
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1 dLIIWdy V.UI 1 1 1 1 lUl 13 
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HumanCyc 


109. 110 


NCBI BioSystems 

I d LI 1 Way V_UllllllUllo 


54 
224 


Kyoto Encyclopedia of Genes 
and Genomes (KEGG) 


111 


NCBI BioSystems 


399 


NCI Nature Pathway 
Interaction Database (PID) 


64 


NCBI BioSystems 
Pathway Commons 


186 
179 


PANTHER 


112. 113 


www . pantherdb . org 


128 


Reactome 


63 


NCBI BioSystems 
Pathway Commons 


1093 
1070 


WikiPathways 


114. 115 


NCBI BioSystems 


147 



From these 3788 pathways, we obtain 3158 unique gene sets. 



scribed region [l 191 - 41 2 lj ) , we assign a SNP to a gene if it is 
within 100 kb of the transcribed region. Others have opted 
for a 20 kb window (52l.[55| based on findings that cis-acting 
expression QTLs are rarely more than 20 kb from the gene 
[481 ] - We prefer a more inclusive mapping of SNPs to genes, 
since the benefit of including potentially relevant SNPs in a 
pathway when the association signal is sparse seems likely 
to outweigh the cost of including a larger number of irrele- 
vant markers. 

Statistical analysis 

The Bayesian variable selection approach to simultaneous 
interrogation of SNPs involves fitting the multi-marker dis- 
ease model to the data using different combinations of 
SNPs. Fitting all markers simultaneously weeds out mul- 
tiple associations at markers that are in linkage disequi- 
librium with one another, leaving only one marker for each 
independent association — an association that signals a vari- 
ant contributing to disease risk independently of other risk- 
conferring variants. 

Likelihood 

The likelihood specifies the probability of observing the 
phenotype observations y given the genotypes X, the in- 
tercept /?o, and the regression coefficients f3 = (J3i, . . . , /3 p ). 
From the additive model for the log-odds of disease (eq. [l| , 
Pi = V'(A) + Y^j=i x ij@j) i s the probability that y< = 1, in 
which ip(x) = 1/ (l+e~ x ) is the sigmoid function. Assuming 
independence of the observations y^, the likelihood is 

n 

p(y\X,/3 ,(3) = l[pr(l~p i ) 1 -y>. (4) 

i=l 

Priors 

Next we specify prior distributions for the genome- wide log- 
odds 8q , the enrichment parameter 9, the intercept /3n, and 
the coefficients /3j of SNPs included in the multi-marker 
model of disease risk. 

Since inferences strongly depend on 9o, an d since 9q is 
unknown and will be different for each setting, we estimate 
this parameter from the data rather than fix it a priori. 



Following [3(| |37j , we assign a uniform prior to 9q ■ We re- 
strict 6*o to [—6, —2], so as few as and as many as ~4400 
SNPs are expected to be included a priori. 

We place a uniform prior on 9, restricted to [0, 3]. This 
prior permits a wide range of enrichments because, in our 
view, enrichments greater than a thousand-fold arc unlikely 
to occur. Note that we do not allow negative enrichments; 
that is, we do not consider pathways that are underrep- 
resented for associations with the phenotype. Negative en- 
richments could potentially be useful in other scenarios, but 
for most genome-wide association studies where there are 
generally few significant associations to begin with, nega- 
tive pathway enrichments are difficult to find and are un- 
likely to have a useful interpretation. 

For the prior on the non-zero coefficients (5j , we follow a 
standard practice that assumes they are i.i.d. normal with 
zero mean and standard deviation a a [12 2| . 1 1 2 3j . Ordinarily, 
to combat sensitivity of the results to the choice of a a , we 
would place a prior on a a and integrate over this parameter 
to let the data drive selection of a a . This approach is taken 
in [3(| H3|- But in our case, we find that the heterogene- 
ity of the odds ratios in Crohn's disease presents a prob- 
lem: although we expect most odds ratios for a common 
disease — and specifically odds ratios in a pathway relevant 
to disease pathogenesis — to be close to 1, the odds ratios 
corresponding to the strongest Crohn's disease associations 
drive estimates of a a toward larger values, and a normal 
distribution that puts too little weight on modest odds ra- 
tios. One possible strategy would be to redo the analysis 
after removing associated regions with the largest odds ra- 
tios, but this is an unattractive solution because SNPs with 
large odds ratios would not contribute to the evidence for 
enrichment. Instead, we fix o~ a , grounding the choice on typ- 
ical odds ratios reported in published genome-wide associ- 
ation studies, and we assess the robustness of our findings 
to this choice. (There is the potential for more principled 
solutions; see Discussion.) Our choice is a a = 0.1, which 
favours odds ratios close to 1 (95% of the odds ratios lie 
between 0.82 and 1.22 a priori), while being large enough 
to capture a significant fraction of the odds ratios for com- 
mon alleles reported in genome-wide association studies of 
complex disease traits. According to a recent review fl5| . 
approximately 40% of estimated odds ratios are between 1.1 
and 1.2, and an additional 10% of odds ratios are smaller 
than 1.1. This prior also closely corresponds to a survey of 
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odds ratios re porte d in genetic association studies of com- 
mon diseases jl24j . Since there may be justification for a 
slightly smaller or slightly larger er a , we also try different 
values for a a , and examine how these choices affect the 
ranking of enriched pathways (see Results). 

To complete the probability model, we assign an im- 
proper uniform prior to the intercept, (3$. In general, one 
must be careful with use of improper priors in Bayesian 
variable selection because they can result in improper pos- 
teriors. A sufficient condition for a proper posterior, and a 
well-defined BF, with logistic regression (eq. [T]) is that the 
maximum likelihood estimator of (3 conditioned on which 
variables are included in the mod el, an d on the other model 
parameters, is unique and finite [125j | . Unfortunately, this 
condition is difficult to check exhaustively. But we can at 
least guarantee that the posterior is proper under the varia- 
tional approximation (see Supplementary Methods) so long 
as the coordinate ascent steps converge to a unique solu- 
tion. 

Bayes factors and posterior odds 

Ideally, we would assess evidence for an enrichment model 
by computing the posterior probability for that model. But 
computing these posterior probabilities is impractical for 
several reasons, one being that it would involve compu- 
tations for a large number of combinations of pathways. 
Instead, we compute a Bayes factor ^ for each candidate 
pathway, or combination of pathways, then we weigh the 
Bayes factor against the prior odds to obtain the posterior 
odds for the enrichment model: 

(posterior odds) (Bayes factor) (prior odds) 

p(fl>0|X,y,q) _ p(y|X,a,fl>0) pj6 > 0) 

p(9 = 0\X,y,a) p(y\X,a,e = 0) * P (9 = 0) ' U 

In our pathway enrichment results (Tables [T] and [2]) , we 
report BFs rather than posterior odds. Although our results 
would be more straightforward to interpret had we provided 
posterior odds instead of BFs, posterior odds are easily ob- 
tained from BFs, and reporting BFs offers the reader flexi- 
bility to judge the evidence based on his or her own prior. In 
the results, we illustrate how to gauge support for each en- 
richment hypothesis by weighing against a "conservative" 
prior. However, the reader may have reason to choose a 
different prior, such as one that favours certain pathways 
above others. 

To allow for uncertainty in 9q and 9 when evaluating 
the BFs, the likelihood under the enrichment hypothesis 
(9 > 0) and the likelihood under the null (9 = 0) are each 
expressed as an average over possible assignments to 9q and 
9: 

BV( , = JJp(y\^a,e ,9)p(9 )p(9)d9d9 

W Jp(y\X,a,9 ,9 = 0)p(9 )d9 ' 1 ' 

Each instance of p(y \ ~K.,a,9o,0) in (JSJ expands as an av- 
erage over possible assignments to the intercept (3q and re- 
gression coefficients (3: 

p(y | X, a, 9 , 6) = JJp(y | X, (3 , f3) p(f3 ) 
p 

x]Jp(p j \a j ,0 ,9)dp dp. (7) 



Factor p(/3j | q,,0 n,0) = 7^-^ (0, cr 2 ) + (l-7rj) S is the "spike 
and slab" prior jl22l Il23j |. in which -Kj = p((3j ^ 0) is 
determined according to @. Here, <5o denotes the delta 
mass, or "spike", at zero, N([i,<j 2 ) is the normal density 
with mean fi and variance a 2 , p(/3o) is the (improper) uni- 
form prior, and p(y | X, (3q, f3) is the logistic regression like- 
lihood (f?]). Computation of Bayes factors is described in 
Supplementary Methods. 

Posterior inclusion probabilities and other statistics 

Here we define PIPs and other posterior quantities used in 
our analysis for the case when a pathway, or combination 
of pathways, is enriched. Posterior statistics under the null 
hypothesis are obtained by setting 9 = 0. 

Like the BFs, the PIPs are obtained by averaging over 
9o and 9. Taking & = {X, y,a} as shorthand for all the 
data, we have 

PIP(j)=p(ft^0|^) 

= // p(ft ^O|^,0 O ,0) p(0 o , 9 | S) d9 Q d9. (8) 

To identify regions of the genome associated with dis- 
ease risk, we calculate, for each region, the posterior prob- 
ability that at least one SNP in the region is included in 
the multi-marker disease model (see Results for an expla- 
nation) . Let S — n represent the event that exactly n SNPs 
in a given region are included in the multi-marker disease 
model, so that Pi = p(S > 1 \ @). These posterior statistics 
are easily calculated from the PIPs using the variational 
approximation (see Supplementary Methods). 

Since no pair of pathways stands out in Table [2] as hav- 
ing greater support than any other pair, we compute the 
posterior statistics P\ by averaging over the different mod- 
els with two enriched pathways, including models with BFs 
too small to be included in the table, weighting these mod- 
els by their BFs. Implicitly, this assumes that all models 
with two enriched pathways are equally plausible a pri- 
ori. The ability to average across models in this way is 
an advantage of adopting the Bayesian approach to model 
comparison, because it allows us to assess genetic associ- 
ations in light of the enrichment evidence without having 
to choose a single pair of pathways. Suppose we have m 
enrichment models with corresponding Bayes 

factors BF(a^)), . . . ,BF(a' Tn '), then P\ for a given region 
is 

_ Ei: i ^>l|^« W )BF(qW) 

£™iESF(a«) ■ W 

Further details about computation of posterior statistics 
are given in Supplementary Methods. 
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Appendix A: Supplementary Methods 

Pathways 

We retrieve most of the pathways from the Pathway 
Commons (6(| and NCBI BioSystems [65[ repositories. 
From the Pathway Commons website, we download the 
October 26, 2011 version of Gene Matrix Transposed 
( .gmt) file for homo sapiens. To retrieve BioSystems path- 
ways, we first get the pathway names and IDs by searching 
for ' 'homo sapiens ' ' [organism] , then save the search re- 
sult as a CSV file. Next, we download the November 15, 
2011 version of the biosystems_gene file from the NCBI 
FTP site, which provides associations between genes and 
pathways. These two repositories include pathways from 
the same databases, but due to differences in versions of 
the databases and data processing procedure, there arc 
discrepancies among pathways. At present, BioSystems ig- 
nores nesting relationships between pathways in the PID. 
This can lead to large discrepancies in pathway gene sets, 
notably in the IL23-mediated signaling pathwayjj Since we 
cannot fully account for all discrepancies in the BioSystems 
and Pathway Commons gene sets, whenever there is dis- 
agreement we include both gene sets, and assess evidence 
for enrichment of these gene sets separately in our analysis. 

We download a version of the BioCarta database at 
www. openbioinf ormatics/gengen. We use this version of 
the BioCarta data because it was used in an previous path- 
way analysis of Crohn's disease (55| . We download version 
3.01 of the PANTHER "sequence association" file from 
their FTP site. From the sequence association file, we re- 
tain lines containing ENSG* accession numbers (correspond- 
ing to human genes) and remove entries that do not map 
to Entrez gene IDs. 

The numbers given in Table 2] are tabulated after dis- 
carding 213 pathways with less than two genes that map 
to the reference genome, and after removing 44 PID path- 
ways from Pathway Commons that contain over 500 genes 
because their definitions include a large number of nested 
pathways. We include all groups of pathways except for two 
unusually large gene sets from the KEGG database that 
are unions of related pathways, "metabolic pathways" and 
"pathways in cancer." 

Computation 

The main difficulty in computing the Bayes factor (eq. [3]) 
is the combinatorially large number of ways we can include 

1 Personal communication with Lewis Geer and Emek Demir. 



SNPs in the additive model of disease risk. In previous 
work J3(|, we described an approximation that yields an 
efficient procedure for computing the likelihood and PIPs. 
(Actually, this approximation was derived for the specific 
case when all variables have the same prior inclusion proba- 
bility, or when 9 = 0, but it is straightforward to extend this 
approximation to the more general case with 9 > 0.) Once 
we have a recipe for efficiently computing p(y | X, a, 9$, 9), 
we are left with the task of computing a one-dimensional 
integral in the denominator of ©, and a double integral in 
the numerator. Each of these integrals is then approximated 
using simple numerical integration techniques. 

The basic idea behind this approximation is to formu- 
late a lower bound to the likelihood, 

p(y\X,a,9 ,9)>e F ^' e °^\ (10) 

then to adjust the free parameters, denoted by <f>, so that 
this bound is as tight as possible. (The exact form of 
F{S>,9o,9,(j)) is derived in [3Q|, and is given below.) This 
lower bound is formulated by introducing a probability dis- 
tribution q((3; 4>) that approximates the posterior of /3 given 
9q and 9 so that maximizing the lower bound corresponds 
to finding the approximating distribution that best matches 
the posterior. More precisely, it amounts to searching for 
the free parameters (f> that minimize the Kullback-Leibler 
diverg ence between g(/3; <f>) and the posterior of /3 given 9 
and 9 [126| . The trick to making this approach tractable 
lies in forcing q{j3; <j)) to observe a simple conditional in- 
dependence property, as originally suggested by [4(| : each 
regression coefficient j3j is independent of the other coeffi- 
cients a posteriori given 9q and 9. In other words, we restrict 
this distribution to be of the form 

<z(/3;^) = n, p =i9(ft-;^), (ii) 

where <pj is the set of free parameters corresponding to the 
jth factor. 

For most SNPs, this conditional independence assump- 
tion is appropriate — most SNPs are unlinked because they 
are on separate chromosomes, or they are weakly linked 
because of recombination. In this case, the fully-factorized 
approximation g(/3; <j)) will closely recover the correct pos- 
terior distribution of the additive effects for these SNPs. 
But the conditional independence assumption is violated 
for SNPs in linkage disequilibrium. In that case, we do 
not expect to obtain accurate posterior statistics and, in 
practice, we find that the lower bound (|10p can be a poor 
substitute to the correct likelihood. However, we are inter- 
ested in accurate computation of BFs, not individual like- 
lihoods, so what matters is whether e F ^' e °' e '^ correctly 
captures the shape of the likelihood, or how the likelihood 
p(y | X, a, 9o, 9) changes as a function of 9q and 9; if the 
lower bound undershoots the exact likelihood by a constant 
factor across different settings of 9q and 9, this constant fac- 
tor will cancel out in the BF. In [36j], we show that the vari- 
ational approximation (when the phenotype is a quantita- 
tive trait) can closely reproduce the shape of the likelihood, 
and can give accurate estimates of some posterior quanti- 
ties, even when the conditional independence assumptions 
are not particularly appropriate. We caution, however, that 
the accuracy the approximation has only been assessed em- 
pirically, and we have no theoretical guarantees of its accu- 
racy. 
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Computing the Bayes factor for one pathway 

To start, we formulate a simple piecewise numerical approx- 
imation to the integrals in ([6]) based on Simpson's rule [1271 ] . 
We replace each instance of the likelihood p(y | X, a, do, 9) 
with its corresponding lower bound (|10j) . Following the dis- 
cussion above, we have 



^ ffe F WoMW)p(O o )p{O)d0d8 o 



J e F(S>^e=oMe ^o)) p (g ) d g Q ' 



so the numerical approximation to the BF is 



BF((i) 



3 alt 



^) 1 .«; 1 -e J! '(». fl o .*=<'^(»o 5 .»=o))p(^ <) ) 



, (13) 



where iUj and are weights obtained by applying 

Simpson's rule. To calculate the numerical approximation 
to the null likelihood, I nu \i, we evaluate the lower bound 
at equally spaced points over the interval [—6,-2]. The 
likelihood under the alternative is a double integral, so we 
evaluate the lower bound at points on a regular grid over 
the rectangular region 9 e [—6,-2], 9 G [0,3]. In eg. [T3l 
the free parameters </> are expressed as a function of 9q and 
9 because we adjust them separately for each setting of 
the hyperparameters (9q,9). (Optimizing <fi involves iterat- 
ing coordinate ascent steps until these steps converge to 
stationary point which constitutes a locally optimal bound 
to the likelihood. Full details about the procedure to solve 
for <f) are given in [36|. This procedure scales linearly with 
the number of samples and the number of SNPs.) Adjusting 
the free parameters for each setting of the hyperparameters 
can be an costly endeavor for a large problem, so to reduce 
the expense of computing the BF we formulate piecewise 
numerical approximations to the integrals using a small 
number of equally spaced points at intervals of length 0.25, 

so Of = -6, -5.75, . . . , -2 and 0« = 0, 0.25, . . . , 3. This 
coarse partitioning risks some loss of accuracy, especially if 
the posterior distribution of the hyperparameters is sharply 
peaked inside the subintervals, and a finer grid is certainly 
possible. An adaptive method that refines the subintervals 
in the pie cewise approximation could have been used in- 
stead [l27l | , but we stick to this simple scheme with equally 
spaced points at larger subintervals because it allows us to 
compute BFs for all ~3000 candidate pathways in a reason- 
able amount of time. 

The analytic expression for the lower bound to the log- 
likelihood is derived in the Appendix of [36| and, for con- 



venience, we reproduce it here: 

t 
2v 
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i=i 



+ fiuiVi - 1) + y T Xr - ±r T X T l>Xr 



^(X^X^Var^ 



3=1 



3 = 1 



l + log(4 



3 3 + V-j 



3 = 1 



£(1 -aj) log [- 



1 — a. 



(14) 



3 = 1 



U — uu / u, u is a column vector with 
\)/r]i : and U is the n X n matrix with 



where the sigmoid function ip(x) is defined above, and the 
prior inclusion probability 7Tj for each SNP is given by @ . 
For this expression we introduce the following definitions: 
ctj is the PIP for SNP j with respect to the approximating 
distribution g(/3;0), fij and are the approximate mean 
and variance of coefficient (3j conditioned on being included 
in the model, Var[/3j] = aj([Mj+s?) — (ajfJ,j) 2 is the variance 
of /3j with respect to the approximating distribution, r is 
a column vector with entries Tj = cxjfJ-j, 0o = l/y/ti is 
the standard deviation of the intercept /3q given (3, (3q = 
y/u is the posterior mode of the intercept (3q when (3 — 
0, (X. T U~K)jj is the jth diagonal entry of matrix product 
X T C/X, and we define u = Yh=i u u V = Y,i=i(Vi ~ §)> 

y = y-\-$ou, U 

entries Uj = (ip(r]i) ■ 
diagonal entries Uj. 

In [3^], to derive this analytic expression for the lower 
bound we made an additional approximation to the non- 
linear factors appearing in t he logistic regression likeli- 
hood p(y|X,/3 ,/9), following [TM Ell. This 

approxima- 
tion introduces an additional set of free parameters, 77 = 
(771 , . . . , rjn) , so implicitly the lower bound (|14[) is a func- 
tion of 77 as well. Like 0, we adjust rj separately for each 
hypcrparameter setting (9q,6). The procedure to solve for 
77 is given in [36|. 

Since the coordinate ascent updates used to solve for 
4> and 77 are only guaranteed to converge to a local maxi- 
mum of the lower bound, the choice of starting point can 
affect the tightness of the lower bound, and the quality of 
the approximation. As we explain in [36{ . this issue can 
be addressed somewhat by using a common initialization 
^(lmt)^(imt)^ £ or ^ ne coordinate ascent updates across all 

grid points (9 , 0^0), in which this initialization is selected 
by first running the coordinate ascent procedure separately 
for each grid point, with random initializations for <f> and 77, 



then assigning 



(init) 



'/ 



(init; 



) to the solution from the hy- 



pcrparameter setting with the largest marginal likelihood 
estimate. In practice, when we follow this procedure we find 
that final estimates of BFs and posterior statistics vary only 
slightly when the analysis is re-run with different random 
initializations. However, we cannot guarantee the possibil- 
ity that a new random starting point leads to the discovery 
of a much better approximation. 
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Computing the posterior inclusion probabilities and other pos- 
terior statistics 

In this section, we describe computation of PIPs and other 
posterior statistics when a pathway, or a combination of 
pathways, is enriched. Computation of these quantities un- 
der the null hypothesis proceeds in a similar manner by 
setting = 0. 

Following the procedure for computing the BFs, we for- 
mulate a piecewise numerical approximation to the integral 
in ©, substituting each PIP conditioned on 0o and with 
the corresponding variational approximation, au (0o , 0) « 
p(flk 7^ | 0, 6q,6). This yields the following approximate 
PIP: 

piP(k)^E l E J ^ 3 Mo^\o^), (15) 

where we define 

such that J2i Ej Wij = 1- Other posterior quantities are 
computed by averaging over 0o and in a similar way. For 
example, the posterior mean enrichment estimate is 



= E[9\9\ 

= Jf9p(9 ,9\£>) d9 d9 



Ei E,- Wij 6® 



(16) 



To compute credible intervals for 9, we add up the nor- 
malized weights Wij over successively wider intervals of 9, 
beginning at the posterior mean 9, until the sum of the nor- 
malized weights Wij reaches 0.95. As a result, the credible 
intervals are at the same resolution as the grid points used 
for the numerical approximation. 

The estimate of the posterior probability that at least 
one SNPs in a given segment of the genome is included in 
the additive model of disease risk is 



EiEitiij[i-p(s = o|#,0$°,0W)]. 



(17) 



where S = n represents the event that n SNPs in the seg- 
ment are included in the model. Assume without loss of 
generality that SNPs in the segment are labeled 1 through 
m. Since the regression coefficients are independent under 
the fully-factorized approximating distribution given 9q and 
9, we have 

p(S = Q\S>, 00,0) =p(Pi = OA ••■A/3 m =0| 0,0o,0) 

«nr=i(!-^o,0)), as) 

so our final estimate of this posterior statistic is 



Pi « e, Ei 1 - nr =1 (i - m$,w)) 



(19) 



To compute Pi = p(S > 2 | &) for a given segment, we 
observe that p(S > 2) = 1 — p(S = 1) — p(S = 0), and under 
the fully-factorized variational approximation, we have that 

p(S = l\9,9 ,9) 

= p(p x £ A ft = A • • • A p m = I 0, O , 0) + • • • 
+ p{J3 1 = A • • • A pm-x = A I3 rn £ | 0, O , 0) 



fe=i 



fe=l 



Ok 



(20) 



Scaling computation to many pathways 

Numerical integration together with the variational approx- 
imation makes it practical to compute BF(a) for one path- 
way, but computing BFs for thousands of pathways is still 
a costly undertaking. We introduce a simplifying assump- 
tion which, we show, yields substantial savings in compu- 
tation. We make the assumption that SNPs outside the en- 
riched pathway are unaffected by the pathway enrichment 
a posteriori. Formally, this means that p(/3 A \ @,0q,9) = 
p{j3 A I ®i ®o, = 0), where A is the set of SNPs assigned to 
the enriched pathway, and A is the remaining set of SNPs. 
In other words, the posterior distribution of the regression 
coefficients for SNPs outside the enriched pathway remains 
the same under the null and enrichment models. With this 
assumption, the posterior distribution of (3 given the hy- 
pcrparameters 0q and becomes 



p{p\9, 0o,0) 



p{/3 A \$>,8 ,e,l3 A ) 
xp(p A \@,9 Q ,9 



0). 



(21) 



This assumption amounts to conditioning on the additive 
effects of SNPs outside the enriched pathway. 

It is of course possible that SNPs contributing evidence 
for pathway enrichment are correlated with SNPs outside 
the pathway, invalidating this assumption. But because we 
assign SNPs to pathways in contiguous blocks (we anno- 
tate all SNPs within 100 kb of a gene in the pathway; see 
Methods), and because the way we assign SNPs to genes 
is not precise (many SNPs assigned to a gene are probably 
not relevant to the gene) , errors as a result of this assump- 
tion are expected to be minor compared to the imprecision 
in the SNP-pathway assignments. On the other hand, if we 
were to assign SNPs to pathways more precisely (e.g. a SNP 
known to modulate expression of a gene in the pathway), 
then this assumption would not be appropriate. 

Next, we show how this assumption allows us to reuse 
computations. This assumption implies that, for any SNP 
j that is not assigned to the enriched pathway, q((3j]ipj) 
remains the same under the null and enrichment models; 
that is, for any j ^ A, <f>j = <p*, in which q(/3;(f>) approxi- 
mates the posterior distribution of (3 given O , and = 0, 
and q(j3]4>*) approximates the posterior distribution of j3 
given 0o and > 0. From this result, the lower bound can 
be written as 

F(0,0o,0,<n = F(0,0 o ,0 = O,<£) 
+ F({X A ,£ Aj£ u = l},0o,0,<&) 
~F({X Al y Al a A = l},9 Ql 9 = 0,r A ), (22) 

where X A is the matrix of genotypes for SNPs assigned 
to the enriched pathway, <p A is the set of free parameters 
corresponding to SNPs j € A, a A is the set of pathway 
annotations restricted to SNPs j £ A (so a A is a vector of 
ones), and y A = y — UX A r A is the vector of binary labels 
"corrected" for SNPs outside the pathway, where X A is the 
genotype matrix for SNPs j ^ A, and r A is the vector r with 
entries rj = oijfij restricted to SNPs j ^ A. Note that (|2"2l 
is valid only if rj is held constant. 

Identity J22]) suggests a way to reuse our computations: 
once we have solved for 4>{8q 1 9 = 0), the free parameters 
4> that (locally) maximize the lower bound under the null 
hypothesis (0 = 0) for a given 0o, to solve for ^(0q,0) for 
any > 0, we only need to adjust the free parameters <pj 
corresponding to SNPs j in the enriched pathway. Crucially, 
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Fig. A.l. Panel A: histogram of gene set sizes for pathways used in the analysis. Panel B: histogram of the number of 
pathways assigned to genes. Panel C: histogram of the number of SNPs assigned to genes. These counts include 
multiple versions of pathways from Pathway Commons and BioSystems. 



r\ must be held constant in (f2"2")l , so for any 8 > 0, we set 
n(8o,8), the free parameters r\ adjusted for setting (80,8) 
of the hyperparameters, to n(8 ,8 = 0). 

Appendix B: Supplementary Results 

Pathway database statistics 

We observe a wide range in the number of genes assigned to 
each pathway (Fig. lA.lK ). Some of the larger gene sets are 
groups of related pathways in the Rcactome and PID hier- 
archies. Out of ~23,000 genes in the reference genome, 9054 
(~39%) are assigned to at least one pathway. The number 
of pathway assignments per gene varies widely (Fig. lA.lB ). 
Among genes assigned to at least one pathway, 95% are 
within 100 kb of a SNP, and 45% of SNPs are mapped 
to at least one gene in a pathway (Fig. IA.1C ). Figure [B~T1 
depicts the hierarchical relationships among pathways in 
Table □ 

More details on associations given enrichment of cytokine sig- 
naling genes 

Here we supply a few more details about regions of the 
genome that show strong support for association only after 
accounting for enrichment of cytokine signaling: the MHC 
class II region, the IBD5 locus, and a region at 17q21 near 
gene STAT3. Without pathway enrichment, the segment 
with the highest Pi at 17q21 is only 0.12, but this increases 
to 0.79 in the model in which cytokine signaling is enriched. 
This association was first identified in a meta-analysis @, 
and was later confirmed in Q. STAT3, the most compelling 
disease-susceptibility gene at this locus, plays a key ro le in 
Thl7 cell differentiation and IL-23 signaling [7(1 [lM Hp . 
It has also been identified as a risk factor for other autoim- 
mune diseases, including ulcerative colitis |94l 195). 

Enrichment of cytokine signaling also yields greater 
support for an association in the IBD5 region at position 
5q31, increasing the probability of an included SNP from 
Pi = 0.18 to Pi = 0.81. The IBD5 locus was first identified 
in a genome- wide study of individuals from Quebec [8!|, 
and this finding has si nce b een replicated and refined by 
several studies [1 0, US [H3- It was identified as a modest 



association for Crohn's disease in the original report (Table 
4 in H) with trend p-value 5.4 x 10~ 6 and BF = 10 4 54 . 
The IBD5 locus includes several candidate genes in a region 
of extensive linkage disequilibrium. Despite this, identifica- 
tion of this locus has suggested the contribution of gene 
variants affecting epithelia l barrier integrity in Crohn's dis- 
ease pathogenesis [l0|, Il33j . 

The MHC class II region, previously identified as a re- 
gion showing moderate evidence of association [51j . be- 
comes a stronger association when cytokine signaling is en- 
riched, as Pi increases from 0.49 to 0.98. The MHC is one of 
the most thoroughly studied regions of the human genome 
for its contribution to regulation of the immune system. 
Association of this locus (also known as IBD3) is widely 
repl icate d for Crohn's disease 0,183,111,1911, ulcerative col- 
itis Il34fl and, not surprisingly, many other autoimmune dis- 
eases [87], |135| . The extensive linkage disequilibrium across 
this gene-dense region complicates identification of disease- 
susceptibility variants; see [87] for a detailed examination 
of specific MHC genes contributing to Crohn's disease risk. 
We also note that evidence for association of the MHC class 
I genes, including TNF, is low under the null in our anal- 
ysis, with Pi = 0.08, and increases to Pi = 0.52 once we 
account for enrichment of cytokine signaling. 

Additional results of sensitivity analysis 

Under the null hypothesis, there is a clear trend in the 
overall effect of the choice of <j a on the distribution of asso- 
ciations; we observe that the posterior mean of the genome- 
wide log-odds 8 increases as a a decreases (Fig. IB. 2) . For 
instance, the normal prior with standard deviation 0.06 cor- 
responds to a posterior mean of 9q = —3.4 and prior inclu- 
sion probabilities of roughly 10 -3 4 w 0.00044 under the 
null hypothesis, which is more than double the proportion 
of SNPs that are independent associations (0.00018) when 
<r o =0.1. 

Fig. IB. 31 depicts the distribution of BFs for different 
choices of a a . For most of the candidate pathways, the sup- 
port for enrichment does not change much. 
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Fig. B.l. Hierarchical relationships among Reactome and PID pathways listed in Tabled] and elsewhere in our 
analysis. For example, "signaling by intcrlcukins" and "interferon signaling" (in addition to other pathways not shown) 
are part of "cytokine signaling in immune system." Ellipses (• • • ) indicate that there are intermediate entries in the 
hierarchy not shown. 
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Fig. B.2. Posterior mean of 9q for each choice of a a . Error bars depict 95% credible intervals. 
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Fig. B.3. Distribution of BFs for different settings of a a . 
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